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Abstract 

Modeling crowd behavior relies on accurate data of pedestrian movements 
at a high level of detail. Imaging sensors such as cameras provide a good 
basis for capturing such detailed pedestrian motion data. However, currently 
available computer vision technologies, when applied to conventional video 
footage, still cannot automatically unveil accurate motions of groups of peo- 
ple or crowds from the image sequences. We present a novel data collection 
approach for studying crowd behavior which uses the increasingly popular 
low-cost sensor Microsoft Kinect. The Kinect captures both standard cam- 
era data and a three-dimensional depth map. Our human detection and 
tracking algorithm is based on agglomerative clustering of depth data cap- 
tured from an elevated view - in contrast to the lateral view used for gesture 
recognition in Kinect gaming applications. Our approach transforms local 
Kinect 3D data to a common world coordinate system in order to stitch to- 
gether human trajectories from multiple Kinects, which allows for a scalable 
and flexible capturing area. At a testbed with real-world pedestrian traffic 
we demonstrate that our approach can provide accurate trajectories from 
three Kinects with a Pedestrian Detection Rate of up to 94% and a Multiple 
Object Tracking Precision of 4 cm. Using a comprehensive dataset of 2240 
captured human trajectories we calibrate three variations of the Social Force 
model. The results of our model validations indicate their particular ability 
to reproduce the observed crowd behavior in microscopic simulations. 
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1. Introduction 

With 60% of the world's population projected to live in urban areas by 
2030, crowd management and modeling is becoming an urgent issue of global 
concern. A better understanding of pedestrian movement can lead to an 
improved use of public spaces, to the appropriate dimensioning of urban 
infrastructure (such as airports, stations and commercial centers), and, most 
importantly, to the design of cities that are more responsive to people and 
to that very fundamental human activity - walking. 

At the urban block and building model scale, predictions on crowd move- 
ment are usually being investigated using microscopic pedestrian simulation 
models. The development and calibration of such models requires highly 
accurate data on pedestrian movements. This data is provided by individ- 
ual movement trajectories in space. Modeling human interaction behavior 
calls for an analysis of all people in a given scene. At the same time, col- 
lecting large amounts of quantitative data on how people move in different 
environments is a very time consuming and elaborate process. 

Traditionally, such data is collected by manually annotating the positions 
of people in individual frames of recorded video data of highly frequented ar- 



eas (Antonini et al. (2006); Berrou et al. (2007)). Sometimes, additional 



attributes such as age or gender are assigned during the annotation process. 
But manual annotation is particularly complex in dense scenes which limits 
the amount of data that can be analyzed. As a result, large scale data on 
human motion can only be obtained from video, using tools for automatic 
vision-based detection and tracking of pedestrians. Currently available com- 
puter vision methods suffer from several limitations, such as occlusions of 
static and moving objects, changing lighting conditions and background vari- 



ations. For example, Breitenstein et al. (2011) describes a tracking approach 



that relies on two dimensional image information from a single, uncalibrated 
camera, without any additional scene knowledge. While this method shows 
an improved performance compared to other state-of-the-art results, occlu- 
sions at higher densities of people lead to missing detections or switching 
individuals. 

In order to avoid severe occlusions most of today's commercially available 
people counter solutions use overhead sensors. Due to their restricted view, 
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multiple sensors are required for a larger capturing area. These sensors are 
often very expensive, so the observation of pedestrian movement on a larger 
spatial scale imposes high costs. Furthermore, commercial solutions usually 
do not provide access to trajectory data. 

Semi-automated video approaches are described in Plaue et al. (2011 ) and 



Johansson and Helbing (2010). They are based on the manual annotation 



of objects (e.g. people's heads) within very few images, which are then 
provided as input to an algorithm that tracks across different frames. While 
such systems have clear advantages in analyzing simple, low density scenes, 
they suffer from both high manual effort and less robust automatic tracking 
in complex scenarios. 

Experimental setups represent another approach for collecting trajectories 
of individuals. In these setups participants can be equipped with distinctive 
wear such as colored hats for better identification. External factors such as 
lighting conditions can be controlled. As a result, the automated extraction 
of trajectories can be very robust. The free software PeTrack presented in 



Boltes et al. (2010) has been applied on video recordings of a bottleneck ex- 



periment. The automatic tracking approaches of Hoogendoorn and Daamen 



(2003) and Hoogendoorn and Daamen (2005) collected trajectory data in 
a narrow bottleneck and a four-directional crossing flow experiment. Con- 
trolled experiments allow the setting of environmental conditions that are 
hard to observe in real world circumstances, as in Daamen and Hoogendoorn 



(2012) where emergency settings were reenacted including acoustic and vi- 
sual signals. However, these setups only allow for a limited sample size and 
include a significant bias in the data since participants are usually aware of 
being observed. 

In this paper we propose a novel approach for automatically collect- 
ing highly accurate and comprehensive data on individual pedestrian move- 
ment using the Microsoft Kinect - a motion sensing input device which was 



originally developed for the Xbox 360 video game console (Microsoft Corp. 



(2012a)). Our proposed use of the Kinect represents a very economical way 
to collect movement data which overcomes many of the above described lim- 
itations. Furthermore, thanks to the increased richness of the sensed data 
in three dimensions, it can open the way to more sophisticated, fine grain 
analyses of crowd movement. 

The Kinect is an inexpensive sensor that delivers not only camera infor- 
mation, but also a 3D depth map which is particularly useful for computer 
vision and pattern recognition purposes. The Kinect was originally designed 
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to accurately detect three dimensional positions of body joints (Shotton et al. 
( 2011[ )) and to estimate human pose (Girshick et al. (2011)). Figure |l| illus- 
trates the skeletal tracking which is the key component of the video game 
user interface. With its built-in functionality, the Kinect can detect up to 
six people (two of them using the skeletal tracking) provided that all per- 
sons face the sensor in frontal view with their upper bodies visible. Since 
its market introduction in 2010, the Kinect has also been used in a broad 



variety of other research fields: Noonan et al. (2011) showed the use of the 



Kinect for tracking body motions in clinical scanning procedures. Animation 
of the hand avatar in a virtual reality setting by combining the Kinect with 



wearable haptic devices was developed in Frati and Prattichizzo (2011). The 



Kinect was used in Izadi et al. (2011) to create detailed three dimensional re- 



constructions of an indoor scene. Weiss et al. (2011) presented a method for 



human shape reconstruction using three dimensional and RGB data provided 
by the Kinect. 

To the best of our knowledge, the Microsoft Kinect has not yet been 
used to obtain data for modeling crowd behavior. For its purpose as a user 
interface, the Kinect has an implemented capability for skeletal tracking of 
individuals. However, this feature cannot be directly used for measuring 
crowd movement. Given the necessary conditions for its built-in people de- 
tector, a single Kinect is not able to deliver stable detections of all individuals 
in crowded scenes with more than six people and mutual occlusions severely 
affect the detection performance. 





Figure 1: Microsoft Kinect provides the depth data stream (a), with the detected person 
in red and different gray levels encoding the depth information, and skeletal tracking (b). 
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Figure 2: MIT's Infinite Corridor with (a) the observed area (green) and (b) the Kinect 
setting on the ceiling. 



In this paper, we demonstrate how the Microsoft Kinect can be used to 
obtain tracking data for crowd modeling. It has the potential to become 
an invaluable tool for analyzing pedestrian movement, overcoming most of 
the limitations of hitherto used automated and semi-automated capturing 
systems. Furthermore, we aim to make the following detailed contributions: 

1. We present algorithms for processing depth data from multiple Kinects 
to retrieve pedestrian trajectories from an elevated view. 

2. We demonstrate the performance of our algorithms in a real world 
setup, also addressing the setting details and sensor calibration. 

3. We use an extensive data set derived from our approach to calibrate and 
compare state-of-the-art microscopic pedestrian simulation models. 

We combined three Kinect sensors and collected a large dataset on crowd 
movement inside the Massachusetts Institute of Technology (MIT)'s Infinite 
Corridor, the longest hallway that serves as the most direct indoor route 
between the east and west ends of the campus and is highly frequented by 
students and visitors. Figure [2^i shows the area identified for the data col- 
lection in this work, and Figure ^jp shows the Kinect sensors mounted at 
the ceiling. In order to observe various pedestrian behaviors we performed 
different walking experiments. 

This paper is structured as follows: Section [2] outlines the setting for 
measuring human motion data using the Kinect. We also explain the cal- 
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ibration process needed in order to derive world coordinate data from the 
Kinect sensors. Furthermore, we describe the algorithms for detecting and 
tracking of humans using multiple Kinects. Section [3] provides evaluation re- 
sults showing the tracking performance in a setting with single and multiple 
Kinects. Section [4] describes the walking experiments and data collection at 
MIT's Infinite Corridor. We describe how these data sets can be used for 
the calibration of crowd models and provide results from the calibration and 
validation of three simulation models. Section [5] concludes the results and 
gives an outlook for further research. 



2. Human Detection and Tracking 

Detailed knowledge of pedestrian flows is of vital importance for the cal- 
ibration and validation of microscopic pedestrian simulation models. The 
Kinect can be thought of as a modified camera. Like a traditional camera it 
provides a sequence of standard RGB color frames. In addition, it delivers 
a 3-dimensional depth image for each frame. The depth image of a scene 
tells us the distance of each point of that particular scene from the Kinect. 
Depth images and RGB color images are both accessible with the Kinect for 



Windows SDK by Microsoft Corp. (2012b). Figure 3 illustrates a snapshot 



of the depth image, the RGB image and a combination of depth and RGB 
from three Kinects mounted at a height of 4.5 meters and a top view position 
in the MIT's Infinite Corridor. With this setup a section of 6 meters of the 
corridor can be captured. Note that the glass case introduces a significant 
amount of artifacts due to specular reflections. In order to meet privacy 
concerns - most of the observed persons are not aware of any data collec- 
tion experiment - our approach does not process RGB information from the 
visible spectrum. 

In order to compute pedestrian trajectories from depth image sequences 
of multiple Kinects, it is necessary to 1) map depth information from indi- 
vidual Kinect sequences into a common world coordinate system, 2) group 
depth information from a single Kinect in the world coordinate system into 
individual pedestrians and track the pedestrians to obtain trajectories and 
3) stitch pedestrian trajectories from multiple Kinect sensors. These three 
steps are described in the following subsections. 
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Sensor £2 




Sensor S3 

Figure 3: Kinect sensor field of view; raw depth data stream (left), RGB stream (middle) 
and both data streams in an overlay (right). 



2.1. Obtaining World Coordinates 

A Kinect sensor S& from a set of K devices generates a time series of 
640 x 480 depth pixel images. Each depth image encodes a set of valid three- 



dimensional points x c 



pCi Uci Z Ci 



with i < 640 x 480, in the local Kinect 



3D camera coordinate system, computed with the value of the focal length / 
provided by Microsoft Corp. (2012b). The physical constraints of the Kinect 
3D-measurement setup limit the range of z c . within which reliable depth 
data can be computed to a maximum distance of 4 meters. Objects which 
are located more than 4 meters away from the sensor can not be captured. 
A human trajectory T is denoted as a sequence of N four-dimensional 
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vectors 

T={[U x Wi y Wt z Wi f} i=1 ... N: (1) 

where the vectors are composed of a timestamp U and a 3D position x Wi = 
l^wi Vwi z wi] T in a common world coordinate system: For a trajectory to 
represent people walking throughout the sensing areas of multiple Kinect 
sensors, the points of the local 3D coordinate systems of the mounted Kinect 
sensors must first be mapped to the world coordinate system. 

The actual point mapping between the coordinate system of sensor Sk 
and the world coordinate system is represented by a rigid transformation, 
composed of a translation vector t& between the two origins of the coordinate 
systems and a 3 x 3 rotation matrix R& such that 

x Wi = RfcX Ci + tfc. (2) 

The three parameter values for translation t& of sensor Sk and its three rota- 
tion angles in R*. are determined by a set of M point matches < x Wj ,x c . >, 
% G M and subsequently minimizing the error 

M 

E = |x Wa - R fc x Ci - t fc | 2 (3) 
i=i 



by solving an overdetermined equation system as described in Forsyth and 



PonceJ(|2002j). 

We determine the M point matches < x w .,x Ci > in world coordinates 
x Wi manually from the depth images. Since only depth information and no 
visual information is available for the sensed area, sensor calibration must be 
based on pre-determined calibration objects with well-defined depth discon- 
tinuities. Our sensor calibration setup is composed of a rectangular piece of 
cardboard placed on a tripod. The reference points in world coordinates x w . 
are determined as the center of gravity of the extracted cardboard corners in 
the depth images. The raw depth data including the reference points and the 
results of the calibration for all sensors are shown in Figure |4j Table [T] shows 
that the Root-Mean-Square Error (RMSE) between the reference points in 
the world coordinates and the reference points in camera coordinates trans- 
formed with (J2J) lies within the range of a few centimeters. 

2.2. Detection and Tracking Algorithm 

Let T> denote the set of points x w . obtained by applying the rigid trans- 
form pi) to the 3D camera coordinates from a Kinect depth image. The 
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Figure 4: Left column - Raw data from sensor including reference points (red circles); 
Right column - sensor calibration results with measured reference (blue) and estimated 
(green) points. 
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Sensor Si 


Sensor S 2 


Sensor S 3 


RMSE 


64 mm 


67 mm 


19 mm 



Table 1: Accuracy of calibration computed on reference points. 



objective of human detection is to extract from D connected sets of points 
belonging to a person and to represent the person with a point x p . Human 
tracking associates detections of individuals over time. Human detection is 
composed of the following steps: 

1. Data Reduction by Background Subtraction. Identifying a set 
of points which do not change or only change slowly over time - the 
background - supports the segmentation of walking persons from other 
objects and reduces the number of depth points to be processed. This 
can be achieved by classic background subtraction techniques from the 
domain of video analysis, e.g. the adaptive background modeling with 



Gaussian Mixture Models described in Stauffer and Grimson (2000). 
In our particular case of the Infinite Corridor, the background model 
is handcrafted, since the locations of background objects such as walls 
are well-known in advance. 

2. Data Reduction by Cutoff. The cutoff step first removes all 3D 
points which remain after background subtraction with height z Wi larger 
than a tall person's height, e.g. 2.1 meters for adults, and all 3D 
points with height z Wi smaller than a typical upper body region, e.g. 
1.5 meters. The second cutoff value determines the minimal height of 
detectable persons, and is necessary to exclude noisy measurements of 
objects near the floor. Applying the cutoff values to z Wi results in a 
subset T>' . 

3. Hierarchical Clustering on the Reduced Set. In order to group 
the points T>' into natural clusters corresponding to individual per- 
sons, we first build a cluster tree by agglomerative clustering with the 



complete-linkage algorithm (Duda et al. (2001)). For computational 



reasons we randomly select a subset T>" of R points from T>' for clus- 
tering, where typically R = 500. The complete-linkage algorithm uses 
the following distance d(T>" ,T>j) to measure the dissimilarity between 
subsets of V": 

d{VlV'^) = maxllx-x'H, (4) 
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with 1 1 • 1 1 as the Euclidean distance. Using metric ^ avoids elongated 
clusters and is advantageous when the true clusters are compact and 
roughly equal in size (Duda et al. ( 2001[ )). All leaves at or below a 



node with a height less than a threshold are grouped into a cluster, 
where the threshold is based on a typical human shoulder width, e.g. 
0.6 meters. 

4. Grouping of V and Cleanup. All available observation points of 
V are assigned to a cluster, given that they are sufficiently close to 
the cluster center. Otherwise they are removed. Small clusters which 
originate from noise or people on the border of the field of view are 
removed. 

5. Identifying a Cluster Representative. For every cluster P", the 
point x Pi representing the pedestrian location of a trajectory Q is 
selected as the point with the 95th percentile of the height z Wi in V", 
defined as the person's height. 

This process provides robust people detections of all individuals in a single 
depth image. In order to obtain correspondences of multiple people over 
consecutive frames and hence trajectories T as defined in Q, a simple nearest 
neighbor matching is used in conjunction with a linear extrapolation from 
preceding frames. While other applications use more complex approaches for 
object tracking (see Berclaz et al. (2011) for an overview), we take advantage 
of the high rate of 30 frames per second provided by the Kinect. Our linear 
extrapolation predicts the position of individuals using the previous n frames, 
where we chose n = 5. Having the predicted location, we search for the 
nearest individual within a certain spatial and temporal threshold. Figure [5] 
shows the tracking results of a short sequence. 

2.3. Tracking over Multiple Sensor Views 

Combining trajectories from multiple Kinect sensors enables the observa- 
tion of pedestrian movement on a larger spatial scale, which yields a richer 
data set for crowd modeling. Such a combination, or stitching of an in- 
dividual's trajectories over multiple sensors requires a correct association of 
trajectory data from the different sensors S^. We apply a trajectory stitching 



We denote x Pi = [ti x Pl y Pl 



approach inspired by |Stauffer| ( |2003 ) and Kaucic et al. (2005). 

} T as the first point of a pedestrian trajec- 
the last point of a pedestrian trajectory 



tory Ti, and x' pjv = [t'j 



• N 



X PN y PN Z p] T 
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Figure 5: Kinect depth raw data in 3D (walls are light gray and detected objects are dark 
gray) with automatic obtained trajectories (red). 



Tj, where z p and z' p are the pedestrian height information averaged over the 
respective trajectory. The Euclidean distance 



dij d{Ti, Tj) 



"■pi 



— x 



Pn 1 



(5) 



then gives an expression of the dissimilarity between trajectory end points, 
i.e. how unlikely it is that % and Tj were generated from the same person. 
Time information is expressed in seconds, and the world coordinates are 
denoted in meters. The four- dimensional features are therefore already in 
the same scale and need not be normalized. Given two trajectory sets of two 
Kinect sensors, the distance ^ is used to build a square distance matrix 



D 



d n 
d 2 i 



d 2 2 



d\ n d-y, 
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mm 



(6) 
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Here, > max(ij)(dij) denotes the null match, used when m > n, i.e. the 
number of trajectories to match between two Kinect sensors is different. 

Pairwise stitching of start and end points of trajectories to combine them 
to a longer trajectory can be expressed as a bipartite graph matching prob- 
lem, which can be solved by the Hungarian algorithm (see 
We consider distances from D as weights of a complete weighted bipartite 
graph. From all possible trajectory matchings C the Hungarian algorithm 
solves this assignment problem by finding the optimal assignment C op t given 
by 

C op t = min ^ d iy ( 7 ) 

(i,j)ec 

The global optimization of the Hungarian algorithm associates all trajectory 
pairs, even those with very high dij. However, large distances dij are very 
likely the result of interrupted or short trajectories caused by detection or 
tracking errors. We therefore apply an iterative approach as follows: 

1. Take into account only trajectories for which elements in the association 
matrix D in (|6| are lower than threshold h. This provides trajectory 
assignments with a very high likelihood of being correct. 

2. Remove already assigned trajectories from D, increase h and calculate 
the assignment with the remaining trajectories. 

3. Repeat step 2 until h has reached an upper boundary. Trajectories 
which are left without assignment cannot be matched. 

After the trajectory matching from all sensor views, we perform a resam- 
pling and smoothing on the combined trajectories based on a cubic spline 
approximation. Having the Kinect sensors in a slightly overlapping setting 
provides a more robust similarity measure in terms of spatio-temporal rela- 
tionship. 

3. Tracking Evaluation 

Real data for pedestrian simulation calibration is often confined to trajec- 
tories which have been manually extracted from video data sets. The reason 
is that the required accuracy of the trajectories is very high, and often only 
manually extracted trajectories can fulfill such accuracy requirements. It is 
thus necessary to compare the output of the Kinect pedestrian tracking de- 
scribed above with the "gold standard" of manually generated trajectories in 



Munkres (19570). 
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order to have an idea how suitable automatic collection of really large data 
sets are. 

3.1. Performance Evaluation of Single Sensor People Tracking 

A human observer annotated the locations of all individuals in single 
frames using the raw depth sensor data from the Kinect. While the Kinect's 
depth data does not allow for identifying persons, the body shape of individ- 
uals is still recognizable. Our evaluation data is composed of two trajectory 
sets: the first data set comprises 15578 frames with pedestrian flows of low 
to medium density, i.e. up to 0.5 persons/m 2 , and a total number of 128 
persons. The second sequence includes 251 frames with a total number of 
21 persons and comparably higher densities of up to 1 person/m 2 . Figure [6] 
illustrates a single frame from the second dataset. 




Figure 6: Kinect depth raw data in 3D (gray) with manually annotated head positions of 
individuals (red circles). 



In a first step, every automatically computed trajectory T is assigned 
to a ground truth trajectory 7g by minimizing a trajectory distance met- 
ric. Quantifying the pairwise trajectory dissimilarity in a distance metric is 
not trivial due to the usually different number of points. Here we used the 
discrete Frechet distance (Eiter and Mannila (1994)). Following an informal 
interpretation, the Frechet distance between two trajectories is the minimum 
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Figure 7: Tracking performance evaluation using ground truth (green) and automatic tra- 
jectories (magenta) including (a) 128 persons with up to 0.5 persons/m 2 and (b) 21 persons 
with up to 1 person/m 2 . 



length of a leash that allows a dog and its owner to walk along their respec- 
tive trajectories, from one end to the other, without backtracking. Taking 
into account the location and ordering of points along the trajectories, the 
Frechet distance is well-suited for the comparison of trajectories and is less 
sensitive to outlier points than alternatives for arbitrary point sets such as 
the Hausdorff distance. 

As a result of the trajectory assignment we derive a set of P match- 
ing trajectory pairs for a time stamp t. Any remaining automatically com- 
puted trajectories which could not be matched are considered as false pos- 
itives. Similarly, any remaining ground truth trajectories which could not 
be matched are considered as misses. Figure [7] shows the results based on 
trajectories from both sequences. Our dataset produced zero false positives 
and one miss. It was seen in the data that this missed person was smaller 
than the defined cutoff value of 1.5 meters. In order to quantify the position 
error for all correctly tracked objects over all frames, we use the Multiple Ob- 



ject Tracking Precision (MOTP) as described in Bernardin and Stiefelhagen 



(2008), which is defined as 



J2- d\ 

Qmotp = ~f=t — ) (8) 
where q is the number of matches found for time t. For each of these 
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matches, d\ denotes the discrete Frechet distance between the automatic 
and the ground truth trajectory. 

The Pedestrian Detection Rate (PDR) measures the rate at which tracked 
pedestrians are matched to the ground truth. The value of PDR varies 
between and 1. While means poor pedestrian detection, 1 means that all 
ground truth pedestrians are matched. The metric is given by 



Q 



TP 



PDR 



TP + FN' 



(9) 



where the number of matched ground truth pedestrians is denoted by true 
positives TP. False negatives FN state the number of missing detections. 
Table [2] provides the evaluation results for our detection and tracking ap- 
proach. Based on the PDR, our approach performs well on both sequences, 
with detection rates above 94%. Also the localization errors stated by the 
MOTP are quite low. 





Qpdr 


Qmotp 


Sequence 1 


96.20% 


41.3 mm 


Sequence 2 


93.86% 


34.0 mm 



Table 2: Tracking evaluation results, showing Pedestrian Detection Rate (PDR) and Multi 
Object Tracking Precision (MOTP). 



3.2. Trajectory Stitching Performance 

The performance evaluation of our method for combining trajectories 
from multiple Kinect sensors is based on ground truth data with manually 
associated trajectories. We randomly selected two sets of automatically ob- 
tained trajectories originating from two Kinect sensors S\ and S2: the first 
data set includes 453 trajectories from S\ and 442 trajectories from S2 respec- 
tively. Here, a subset of 119 trajectories was manually assigned serving as 
ground truth data. The second data set comprises 1402 trajectories from Si 
and 1423 trajectories from 5*2 with a manually assigned subset of 50 trajec- 
tories. A selection of ten trajectories from Kinect sensors S\ and 5*2 is shown 
in Figure |8j In this sensor setting, the trajectories are slightly overlapping 
which allows to derive a more robust similarity measure. 

As described in Section 2.3 trajectories from both data sets were au- 
tomatically combined by applying the Hungarian algorithm in an iterative 
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Figure 8: Association of trajectories from Kinect sensor S\ (red) and S2 (blue). 

manner. For both subsets we then compared the assigned trajectories derived 
by the automatic approach with the manual annotation. It turns out that in- 
creasing the threshold h reduces the assignment quality which is documented 
by the True Positive Ratio (TPR) in Table [3j This confirms our assumption 
that erroneous trajectories decrease the assignment quality when applying a 
global optimization with the Hungarian algorithm. However, the results can 
be significantly improved by restricting the assignment to trajectories within 
a lower threshold h only. Iteratively increasing h enables to combine even 
severe interrupted or short trajectories. 



Threshold 


Subset 1 


Subset 2 


h 


TPR 


TPR 


3 


98.00% 


99.16% 


6 


98.00% 


99.16% 


9 


90.00% 


99.16% 


12 


76.00% 


89.08% 


15 


80.00% 


91.60% 


18 


82.00% 


89.92% 


21 


84.00% 


89.92% 


23 


70.00% 


86.55% 



Table 3: Evaluation results for stitching performance. 

4. Calibration of Crowd Models 

Crowd behavior models are used to simulate and predict how humans 
move around in different environments such as buildings or public spaces. In 
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Figure 9: Trajectories for the calibration of crowd models automatically retrieved from 
(a) experiment 1 and (b) experiment 2 (walking directions are encoded in red and blue). 



order to reflect realistic behavior, crowd behavior models must rely on em- 
pirical observations which ideally include a broad variety of human walking 
behavior. We performed a variety of walking experiments at the MIT's Infi- 
nite Corridor described in Section [2] while capturing depth image sequences 
of three Kinect sensors. Applying the people tracking algorithm of Section [2] 
on the collected Kinect data sets left us with a comprehensive amount of 
robust trajectories. These trajectories provide the necessary information for 
calibrating different types of microscopic pedestrian simulation models. In 
the following we present experimental results of comparing three variations 



of the Social Force Model (see Helbing and Molnar (1995)) based on our data 



collected for calibrating these models. 

4-1- Walking Experiments 

We performed the walking experiments under real world conditions, mean- 
ing that the individuals crossing MIT's Infinite Corridor had no information 
about being observed. The main task was to calibrate different microscopic 
pedestrian simulation models on the operational level with relatively simple 
scenarios, which allow to neglect the tactical level such as route choice. 
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(a) (b) 
Figure 10: Walking speed distribution from (a) experiment 1 and (b) experiment 2. 



In the first walking experiment, a person standing in the center of the 
observed area served as an obstacle for passing people. The 558 trajectories 
of this setting were recorded during a period of approximately 28 minutes 
(see Figure^). The second walking experiment includes "normal" walking 
behavior without any external influence for a time span of around one hour. 
The 1682 trajectories computed with our Kinect approach are illustrated in 
Figure |9)d. The red and blue trajectories in Figure [9^, and b represent the 
two walking lanes in opposite directions which people form most of the time. 



Figures 10 1 and b show the walking speed histograms computed from 



the trajectories of the two calibration data sets (the velocity of the person 
acting as an obstacle in experiment 1 is filtered out). Fitted parameters of a 
Gaussian function to the data set result in a mean speed of 1.34 m/s and a 
standard deviation of 0.25 m/s. Experiment 2 shows similar results for the 
walking speed distribution with a mean speed of 1.29 m/s and a standard 
deviation of 0.33 m/s. 

4-2. Pedestrian Simulation Model Description 

The models for the simulations in this work are all based on the Social 



Force model as presented in Helbing and Molnar (1995). Given that move 



ment depends on velocity and hence on acceleration, the principle of the 
Social Force model aims at representing individual walking behavior as a 
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sum of different accelerations as 



f«(*) 



0^a i 



(10) 



The acceleration f Q at time t of an individual a towards a certain goal is 
defined by the desired direction of movement e a with a desired speed 
Here, the current velocity v a is adapted to the desired speed u° within a 
certain relaxation time r a . The movement of a pedestrian a is influenced 
by other pedestrians (3 which is modeled as a repulsive acceleration f a p. A 
similar repulsive behavior for static obstacles i (e.g. walls) is represented by 
the acceleration f a j. For notational simplicity, we omit the dependence on 
time t for the rest of the paper. 

There exist several different formulations of the Social Force model in the 
literature. We compare three variations of the Social Force model based on 



the general formulation (10). 



Model A: The first model from Helbing and Molnar (1995) is based on 



a circular specification of the repulsive force given as 



v al3 



lap 



l a/3| 



(11) 



where r a and rp denote the radii of pedestrians a and 0, and d a p is the dis- 
tance vector pointing from pedestrian a to /3. The interaction of pedestrian 
a is parameterized by the strength a a and the range b a , whereas their values 
need to be found in the calibration process. 

Model B: The second model uses the elliptical specification of the re- 



pulsive force as described in Helbing and Johansson (2009) determined by 



l a/3 



"a 3 



a n e 



la/3 



la/3 1 



where the semi-minor axis w a p of the elliptic formulation is given by 



la/3 1 



+ lid 



a/3 
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At I 
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(12) 



(13) 



Here, the velocity vectors v a and of pedestrians a and (3 are included 
allowing to take into account the step size of pedestrians. 



Model C: The third model is an implementation of Rudloff et al. (2011) 



in which the repulsive force is split into one force directed in the opposite 
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of the walking direction, i.e. the deceleration force, and another one per- 
pendicular to it, i.e. the evasive force. Here, the repulsive force is given 

as 



- bnB al3 



1 Q|9 



+ p a a p e "rel 



la? 



(14) 



deceleration force 



evasive force 



where n a is the direction of movement of pedestrian a and p a its perpendic- 
ular vector directing away from pedestrian /3. Furthermore, 9 a p is the angle 
between n a and d a p and v re y denotes the relative velocity between pedestrians 
a and (3. 

We denote the implementations of the three above described repulsive 
formulations of the Social Force model as fA,, and f^. Note that the 
repulsive force from static obstacles f ai is modeled by using the same func- 
tional form as given by the repulsive force from pedestrians. Here, the point 
of an obstacle i closest to pedestrian a replaces the position /3 and v, is set 
to zero. Furthermore, we take into account that pedestrians have a higher 
response to other pedestrians in front of them by including an anisotropic 



behavior, as described in Helbing and Johansson (2009), into the first two 
formulations. 



4-3. Model Calibration 

The process of model calibration involves the finding of parameter val- 
ues which produce realistic crowd behavior in the simulation results. We 
estimated values for the different parameters in the three described model 
approaches f^, and based on our empirical data set from the walk- 
ing experiments. The trajectory data were divided into a non-overlapping 



calibration and validation data set (validation is described in Section 4.4) as 
shown in Table HJ 





Number of trajectories 


Experiment 1 


Experiment 2 


Calibration Set 


424 


1121 


Validation Set 


134 


561 


Total 


558 


1682 



Tabic 4: Partitioning of the trajectory data set for model calibration and validation. 
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The literature describes different techniques for calibrating microscopic 
simulation models: one way is to estimate parameter values directly from the 



trajectory data by extracting pedestrian's acceleration (Hoogendoorn and 



Daamen (2006)). However, as shown in Rudloff et al. (2011) this method has 
several drawbacks, even with small errors in the trajectories. For instance, 
using the acceleration instead of the spatial position introduces a significant 
noise due to the second derivative. Furthermore, this might lead to error-in- 
variables problems and parameter estimates possibly result in a bias towards 
zero. 

Our calibration uses a simulation approach, where each pedestrian is sim- 
ulated separately while keeping the remaining pedestrians on their observed 
trajectory. Each simulation run is performed according to the following pro- 
cedure: the position and the desired goal for a simulated pedestrian a are 
extracted from the start and end point of the associated observed trajectory 
T a . The desired velocity i>° of pedestrian a is defined as the 90th percentile 
of the observed velocities. The magnitude of the current velocity vector v a 
is set equal to v °, directing towards the pedestrian's desired goal. Pedestrian 
a is simulated for M a = \T a \ timesteps during time t, with t™ < t < 
where both bounds are again derived from the observed trajectory. 

After having simulated a set of N pedestrians from the calibration data 
set with the above procedure, a similarity measure s for testing the fit of our 
simulated trajectories can be computed as 



htljA +»(«))■ <"> 



a=l x " L 

For a pedestrian a, the mean Euclidean distance 

d{a) = d{T a ,X) = —^2\\x ai -x' ai \\ (16) 

a i=l 

provides the dissimilarity between positions x Q . = [t a ., x ao y ai ] T of the ob- 
served trajectory T a and positions x^. = [t' a ., x' a , y' ai ] T of the simulated tra- 
jectory 7^. Furthermore, the length of trajectories is defined by \T a \ = |7^| = 
M a . Since none of the used models explicitly restricts overlapping between 
pedestrians, an overlap penalty is added denoted by 

q(oc) = — — ; - ^ max ( 0, r— —. H ). (17) 

yK J N-lf£ t V '|I<M*)II r a + rj 1 ; 
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Model parameter values are estimated by applying an optimization al- 
gorithm to find the best possible fit by minimizing the objective function 
(15). We use a genetic algorithm which does not suffer from a starting value 



problem to find the neighborhood of the global minimum. The estimated 
parameter values obtained by the genetic algorithm are then used as initial 
values for the Nelder-Mead algorithm (see Lagarias et al. (1998)) to refine 



the result. This hybrid approach allows finding the global minimum while 
being numerically efficient. 

4-4- Validation Results 

The results for the parameter fit of the individual models are provided in 
Table [5] as s ca i for the calibration data set and s va i for the validation data set. 
The best possible value for (15) is s — 0. For both experiments, the best fit 



of the objective function with the compared modeling approaches could be 



achieved using the repulsive formulation from fr^ defined in ( 14 ) 





Experiment 1 


Experiment 2 


fA 


fB 


L a/3 


f A 


fB 


1 o/3 


"Seal 


0.0951 


0.0887 


0.0640 


0.0932 


0.0925 


0.0820 


■Sval 


0.1439 


0.0927 


0.0826 


0.1017 


0.0996 


0.0929 



Table 5: Fit of the parameter values for three different Social Force formulations based on 
calibration and validation data set. 

By applying the three Social Force models on only a small subset of our 
validation data set, their basic ability of representing crowd behavior can be 
evaluated in a qualitative manner. Figure [TT] shows the results of a simula- 



tion run with 19 pedestrians in the setting of experiment 1: the simulation 



results of the circular force formulation from in Figure 



11 



i indicate that 
oration caused 



simulated pedestrians evade relatively late with a strong dece 
by the static person in the center. To avoid running into the obstacle some 
pedestrians even move slightly backward from the obstacle. This collision 
avoidance behavior differs significantly from the observed trajectories. As 
illustrated in Figure [TTfc , the walking behavior from the simulations with 
is less abrupt as a result of the included velocity dependence. However, 
pedestrian deceleration is again unrealistically strong when individuals di- 
rectly approach the static obstacle. From a qualitative point of view, simula- 
tion results obtained by using exhibit the best results in our comparison 
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(see Figure |llp). Separating the forces into a deceleration and an evasive 
component results in individual trajectories which match very well with the 
observations. 

For capacity estimations in infrastructures the walking times of pedestri- 
ans are of particular importance. Accordingly, pedestrian simulation models 
need to be able to reproduce realistic walking times even if they are not 
specifically calibrated for this purpose. Since the models in this work were 
calibrated using the similarity of trajectories as the objective function, we 
also want to evaluate their ability to correctly predict the walking time dis- 
tribution based on our validation data set. Figure 12 shows the cumulative 
distribution functions of walking times t w derived from measured F M and 
simulated F A , F B , F c trajectories provided by f^, f^, respectively. 
The results for experiment 1 (see Figure [i~2^i) demonstrate that the circular 
F A and elliptical F B formulation for the repulsive force in the Social Force 
model significantly deviate from the measured walking time distribution F M . 
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However, the formulation used to derive F c provides a good replication of 
the measured walking time distribution F M . In order to support this find- 
ing, we used a two-sample Kolmogorov-Smirnov test (see Massey (1951)) to 



compare each walking time distribution from the simulations with the mea- 
sured distribution F AI . For a significance level of 0.05, we can reject the 
null hypothesis that F A and F M as well as F B and F M are from the same 
continuous distribution. However, the null hypothesis holds when comparing 
F c and F M . 



5. Conclusion 

In this work we have developed algorithms to use the Microsoft Kinect - 
basically a camera that also records 3-dimensional information in the form of 
a depth image - for automatic data collection of crowd movement from an el- 
evated view. We have shown that the use of the Kinect allows the automated 
capture of human motion trajectories with high accuracy, overcoming many 
limitations of methods that have been applied so far. The scanning area 
is scalable by combining multiple Kinects, thus allowing high flexibility for 
measurements in different environments. We applied our tracking algorithm 
to collect an extensive data set in the MIT's Infinite Corridor for calibrating 
and comparing three variations of the Social Force model. 

In order to capture human motion trajectories throughout the sensing 
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areas of multiple Kinects, the depth information from individual Kinect se- 
quences is mapped into a common world coordinate system using a rigid 
transformation. Our approach groups depth information from a single Kinect 
in the world coordinate system into individual pedestrians based on hierar- 
chical clustering. These detections are tracked over time to obtain individual 
trajectories. 

Evaluating the detection performance with two manually annotated ground 
truth data sets shows a Pedestrian Detection Rate of 94% and 96%, respec- 
tively. The position error for all correctly tracked objects is quantified as 
Multiple Object Tracking Precision and reveals relatively small values of 
around 4 cm. In order to observe pedestrians on a larger spatial scale, we de- 
veloped methods for combining pedestrian trajectories from multiple Kinect 
sensors. Again, we evaluated our trajectory stitching with manually anno- 
tated ground truth data sets and received a True Positive Ratio of up to 
98%. In conclusion, our tracking approach is capable of delivering trajecto- 
ries with an accuracy which we consider sufficient for calibrating microscopic 
pedestrian simulation models. In the future our approach could be extended 
in order to also estimate the orientation of body parts, i.e. head and shoulder 
pose. This would allow us to gain more data on how humans perceive and 
interact with their environment which is particularly useful for evaluating 
visual information systems, such as guidance systems or lights. 

By applying our tracking approach in two walking experiments performed 
under real world conditions in the MIT's Infinite Corridor, we gathered a to- 
tal of 2240 trajectories. We compared three variations of the Social Force 
model by calibrating them with our trajectory data. The validation results 
revealed that collision avoidance behavior in the Social Force model can be 
improved by including the relative velocity between individuals. Further- 
more, dividing the repulsive force into a deceleration and an evasion part 
delivered the best quantitative and qualitative results out of the investigated 
models. However, dividing the repulsive force leads to a larger number of 
parameters, which makes the calibration process itself more complex and 
computationally expensive. For future work we will increase our data set by 
obtaining trajectories under various experimental settings, such as involving 
different forms of obstacles. Going forward we believe that the adoption of 
the Kinect could be extremely useful for the development and calibration of 
crowd models - but also as a tool to better understand human crowd behav- 
ior and hence provide invaluable input to the design of all those spaces that 
need to respond to it - starting from our cities. 
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